perm filename SPINDL.REM[UP,DOC]1 blob
sn#214812 filedate 1976-05-11 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00002 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00002 00002 .BEGIN
C00020 ENDMK
C⊗;
.BEGIN
.REQUIRE "MEMO.PUB[LET,JMC]" SOURCE;
.cb SPINDL - A PROGRAM FOR FILE COMPRESSION
.CB "by Robert E. Maas (REM)"
.CB "(This file is SPINDL.REM[UP,DOC])"
Disk space in the Lab is in short supply and will
remain that way for the forseeable future. Therefore, two
methods of reducing the amount of space taken by files are
hereby offered. The first of these allows combining several
files into one, and the second uses Huffman coding to further
reduce the space taken to slightly less than half (in most
cases) its original volume. The two facilities are offered by
the same program called SPINDL.
Each disk file consumes at least a full disk block
(2304 words), even if it contains only one word of actual
data. Most such files are text files, and can now be
combined into a "spindle" so that many fewer disk blocks are
consumed. Later, as desired, individual files in the spindle
may be retrieved. Since the system DIrectory command measures
files in words rather than blocks, SPINDL will help the system
even if the DIrectory command shows no reduction of word
count.
To use this new facility, type the monitor command
R SPINDL
The program will announce itself and ask you for the name of
the spindle file you want to use, then prompt you with a ">"
at the left margin. You type a file name, default extension
is ".SPI". If the file doesn't already exist, it will be
created. The program will now announce the number of files
currently in the spindle, and print a "*" at the left margin
to indicate that it is waiting for a command. If you type a
"?" followed by carriage-return, it will provide a list of
legal commands. The following commands will perform most of
the useful functions you will want:
DIrectory -- tells you all files in the spindle.
Spindle -- copies a file from the disk into the spindle.
(Note, the original remains on the disk, so at present the
user must manually delete it after spindling if he is to
reduce disk-space consumption.)
Unspindle -- copies a file from the spindle to the disk.
(Note, original remains in spindle until explicitly deleted.)
DElete -- deletes a file from the spindle (without reclaiming
space).
Bubble -- copies the entire spindle, deleting all wasted space
such as deleted files.
After some of the commands (for example, Spindle,
Unspindle, and DElete) the program will ask you for various
items, such as the names of the files to use, and the prompt
for typing each such item will be a ">" at the left margin.
When it is all done, it will prompt "*" at the left margin
indicating it is ready for another top-level command.
CRUNCH-AND-SPINDLE:
Greater data compression at the cost of more computer
time is obtained with the SPINDL command "Crunch". (It uses
Huffman coding to reduce the volume of an English or FAIL or
SAIL, etc. file by a further factor of two). SPINDL will ask
you for auxiliary files called the history tree and the
Huffman tree, but carriage return will use default trees (if
the spindle already contains a crunched file, default is the
same trees as used the previous time, otherwise default is a
particular pair of trees on [UP,REM] which work quite well
with most English language text such as writeups and essays).
Regardless of whether the default trees or some other trees
are used, one copy of each tree actualy used will be copied
into your spindle file (if it isn't there already), so that
crunching only achieves a reduction of data if the files
being crunched with a particular pair of trees total at least
1400-3000 words (assuming a 2-to-1 compression and assuming
700-1500 word trees). Naturally, it is also pointless to spindle a
single file without crunching, or to crunch a single file unless its size exceeds
2.3K.
Appendix A explains how to make special trees,
assuming none of the already-existing trees listed in Appendix
B are adequate. (Currently only trees for English exist, but
soon an attempt will be made to create trees for FAIL, SAIL,
LISP, and perhaps LAP and SNOBOL.)
To uncrunch-and-unspindle, just use the "Unspindle"
command and the right thing will happen automatically, the
correct trees will be selected from the spindle file according
to their hash number and will used for uncrunching your file.
Timing:
Spindle -- 3/4 second per thousand words.
Unspindle -- 1/2 second per thousand words.
Crunch-by-pages-and-spindle -- 4 sec to load default
trees + 2 sec per thousand words.
Uncrunch-by-pages-and-spindle -- 3.3 sec to load trees
+ 1 sec per thousand words.
When should a file be SPINDLed? Our current
recommendation is that a file not expected to be used in 30
days should be spindled, and if not expected to be used for
another 15 days, crunched too. These recommendations are
conservative, and will change in the direction of more prompt
spindling when the KL-10 is working.
Many people will be able to double the effective sizes
of their disk allocations by using these facilities.
.skip
.CB APPENDIX A
.at 8 ⊂ BREAK; ONCE INDENT 8; ⊃;
At present it is rather a hassle to create new history
and Huffman trees for encoding a file, however here is an
explanation of how to do it should you really want to. If you
already have a list of reversed strings to use, or want to use
one of REM's, skip over steps 1-5 and begin at step 6. If you
already have the two Polish files you want to use for
crunching, you may begin at step 7.
STEP 1 -- Create a list of strings occurring most often:
RU CRU1[1,REM] -- Starts up the survey program.
B -- Sets the program to Bigram/Trigram mode in which
it scans a file for the strings which occur most often.
The program will ask you for input file name and write
TOKENS.DAT.
STEP 2 -- Modify the list of strings using your favorite
editor, employing whatever heuristics you have for eliminating
undesired strings or adding strings you believe will be
useful.
STEP 3 -- Reverse the list of strings so that they read
backwards (from a given point in a file back to earlier
characters):
RU CRU1[1,REM]
R -- Sets the program to Reverse mode.
TOKENS.DAT -- Tells it the name of the file to read.
The program will write SNEKOT.DAT.
STEP 4 -- Separate the three parts of the file and sort them
individually into lexicographic sequence:
ET SNEKOT.DAT/N
Locate the point where the strings change from
two-characters to one-character, and the point where they
change from one-character to three-characters. Put a page
mark at each break.
E -- Exit the editor.
DO SNE[1,REM] -- Selects the correct pages, creating
SN.1 SN.2 and SN.3 containing strings of a particular length,
assuming your edit was correct.
R SSORT -- Starts up the String-Sorter program.
SN.2 -- Sort the 2-letter strings, when it asks you if
it is ok to overwrite, say Y.
R SSORT
SN.3
COPY <FILE>/N/L←SN.1,SN.2,SN.3 -- Whatever <FILE> you
want, check it to be sure the strings are sorted by length,
and in each group alphanumerically.
STEP 5 -- Modify the list of reversed strings again if
desired. Seeing them alphabetized according to their new
backwards-format may suggest additional heuristics for purge
or addition.
[If you jumped over steps 1-5, there is a spindle called
SNEKOT.SPI[1,REM] which contains various lists of reversed
strings you may want to use.]
STEP 6 -- Scan your collection of files according to the list
of left-contexts defined by the reversed strings you have
created or found:
RU HOTER[HOT,REM] or R PTYJOB -- whichever your
favorite pty-job program, have it output a disk file so you
can save all the TTY output containing the Huffman code, so
that you can examine it sometime later to try to figure out
how to improve it by deleting or inserting left-contexts. If
you are sure you won't want to examine the code, omit this
PTYJOB/HOTER part and just do the following directly on your
terminal.
RU CRU1[1,REM]
S -- Sets the program to Scan-by-Reversed-Left-Context
mode.
<FILE> -- The list-of-left-contexts file you
laboriously created.
The program will ask you one at a time for files to
survey according to the list of left-contexts that were read
in, compiling a complete '200 word table of how many times
each ASCII character occurs in that left-context. When you
are finished with all the files you want to include, type a
blank line instead of a file name.
The program will now type out all the totals, which
you want to suppress by control-o or escape-o depending on the
type of terminal you are on.
After all that, output will be turned back on by the
program, it will write out the file HIST.POL which is a Polish
description of all the left-contexts (typically this file is
between 50 and 200 words).
After that, the program will begin computing an
abbreviated Huffman code for each left-context, and type on
the TTY or PTY the input bits, output bits, compression-ratio,
and a complete description of the Huffman code in nice
readable verbose format. If you are running this through
HOTER you will probably want to do a local ↑O command to
suppress output to TTY without affecting the copying of PTY
output to the disk file. (PTYJOB does not provide this
facility, only HOTER.) This process will probably take about
15 minutes because of the volume of data being inefficiently
copied, all the while the file HUFFS.POL will be written
containing each Huffman code generated (typically about
500-2000 words total).
You now have HIST.POL and HUFFS.POL containing a
complete description of the code to be used.
[If you jumped directly here, there are several useful history
and Huffman trees on [UP,REM] with extensions *.HIS and *.HUF
respectively.]
STEP 7 -- When using the Crunch-by-pages-and-spindle command
in SPINDL, specify the files you created in step 6 or the
files you found already existing *.HIS and *.HUF -- only one
copy of each different tree will be kept in the spindle that
contains files crunched by it.
.skip
.CB APPENDIX B
All existing trees are in the disk area [UP,REM].
History trees have extension .HIS and Huffman trees have
extension .HUF. The following table tells which are available
and what files they are good for.
.END
.BEGIN NOFILL
.FONT A "GACL25";SELECT A
*.HIS *.HUF type of file it is for
NOTSEV NOTIC1 English text (surveyed NOTICE[UP,DOC]) (DEFAULT TREES)
NOTSEV SEVEN1 English text (same left-contexts but larger survey)
.END